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On the Consistency of the Crossmatch Test 
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Abstract 

Rosenbaum (2005) proposed the crossmatch test for two-sample goodness-of-fit testing in 
arbitrary dimensions. We prove that the test is consistent against all fixed alternatives. In 
the process, we develop a general consistency result based on (Henze and Penrose, 1999) that 
applies more generally. 


1 Introduction 

Two-sample goodness-of-fit testing is an important line of research in statistics. We consider the 
continuous setting where we observe two independent samples, Xi,... and li,... ,y„ in 
Classical approaches include the Kolmogorov-Smirnov test (Kolmogorov, 1933; Smirnov, 1939), 
the number-of-runs test (Wald and Wolfowitz, 1940), and the longest-run test (Hosteller, 1941). 
These procedures were originally designed for real-valued observations (d = 1). Over the years, 
a number of approaches that apply to vector-valued observations {d > 1) have been suggested. 
Among these are a class of tests based on graph constructions. This goes back at least to the work 
of Friedman and Steppel (1974). Their method is based on counting the number of A’s among 
the A-nearest neighbors each observation in the combined sample. See also (Rogers, 1976) and 
more recently (Hall and Tajvidi, 2002). Although it does not cover all the possibilities, many of 
the subsequent proposals can be framed as follows. Let t = m + n denote the total sample size, and 
let ^ be a directed graph with node set the combined sample {Zi,..., Z^}, with Z^ = X^ \i k <m 
and Zfc = Ifc-m if A; > m. We write Zj ^ Zj when there is an edge from Zj to Zj in Q. Consider 
rejecting for small values of 

X£;(Z) = #{i < m,j >m-. Zi^ Zj} + #{i <m,j > m: Zj ^ Zi}, (1) 

which is the number of neighbors in the graph from different samples. If the graph Q is the K- 
nearest neighbor graph — where Zj Zj if Zj is among the A-nearest neighbors of Zj in Euclidean 
distance — and we assume that all the Z’s are distinct, then the resulting test is that of Schilling 
(1986), a special case of the general approach of Friedman and Steppel (1974). If the graph ^ is a 
minimum spanning tree (starting with the complete graph weighted by the Euclidean distances), 
then the resulting test is the multivariate runs test of Friedman and Rafsky (1979). 

If the graph ^ is a minimum distance matching, then the resulting test is that of Rosenbaum 
(2005), which is the method that we analyze in the present paper — and possibly the latest in this 
line. Rosenbaum (2005) calls his method the crossmatch test. The graph ^ is a minimal distance 
matching of the combined sample, where the distances are understood here as being the Euclidean 
distances. In detail, the following optimization problem is solved 

miny; ||Zfc-Z^(fc)||, (2) 

k=i 
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where the minimization is over permutations a of [t] := {1,... ,t} of order 2 with at most one fixed 
point, meaning, a{a{k)) = k for all k and cr(/c) 4^ k for all k except at most one. (Note that there is 
a hxed point if and only if t is odd.) Choosing a solution a at random if there are several, the graph 
Q is then dehned as the undirected graph with vertex set [t] and edge set {{k,a{k)) ■ cr{k) 4 k}. 

Among methods based on graph constructions, the crossmatch remains poorly understood. 
Rosenbaum (2005) derives the null distribution — which is also studied in Heller et al. (2010). 
While most other tests need to be calibrated by permutation, the crossmatch test has the nice 
feature of having a null distribution (which coincides with its permutation distribution) available 
in closed form. However, the power properties of the test, and in particular its consistency, have 
not been established. 

In the present paper, we prove that the crossmatch is consistent against all hxed alternatives. 
The proof is based on the arguments developed by Henze and Penrose (1999) to establish the 
consistency of the multivariate runs test of Friedman and Rafsky (1979). Based on their work, we 
develop a general result that applies to graph-based tests where the graph has short edges and 
bounded degree, which in particular applies to other matchings and to the nearest-neighbor graph 
method of Schilling (1986). 

2 Setting 

We observe two independent samples, Xi,... ,Xm HD with density / and Yi,...,Tn HD with 
density g, both with respect to the Lebesgue measure on The goal is to test 

Ho - f = g versus Hi'- f 4 g. (3) 

(Of course, this is understood modulo a set of measure zero.) We assume that the sample sizes are 
comparable in the sense that 

777 

- -p€(0,l). (4) 

m + n 

Formally, a graph construction is a function Q dehned on the hnite (unordered) subsets of M'^, 
where for zi,...,zt e Q{zi,..., zt) is a simple^ directed graph with k nodes. Recalling the 
dehnition of the Z^s in the Introduction, the method starts by constructing the graph based on 
the combined sample, which means computing ^(Z). Once this is done, the statistic xg(Z), dehned 
in (1), is computed. The test rejects for small values of Xe(Z) and calibration is typically done by 
permutation. 

Remark 1. Although the literature on graph-based goodness-of-fit testing in arbitrary dimensions 
is silent on the topic, the setting exhibits a typical curse of dimensionality, as we discuss in a 
forthcoming paper (Arias-Castro and Pelletier, 2015). We therefore assume that d is constant. 

3 Almost sure convergence for general graphs 

For z = {zi,... ,zt] in M'^, let A°"*( 2 ;fc;^(z)) (resp. /X'^{zk]Q{ 2 ,))) denote the out-degree (resp. in¬ 
degree) of Zk in graph ^(z). We assume that Q has degree bounded by 6o, meaning that, for any 
set of points z = {zi ,..., zt}, 

A°'^\zf,g{z))vA^^zf,g{z))<6o. (5) 

simple graph has no multi-edges and no self-loops. Such a graph (if unweighted) can be represented by its 
adjacency matrix, having only O’s and I’s, with all O’s on the diagonal. 
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We also assume that the out-degree is essentially constant and that long edges are essentially 
absent. Specifically, following (Henze and Penrose, 1999), let cj), 4)1,4)2 ,... be any density functions 
with same support such that 4'tl4' 1 uniformly on {4> > 0}. Let ..., Zf^t} be IID with 

density 4>f Then, under these circumstances, we assume that the out-degree satisfies 

t-oo, (6) 

where <5 > 0 only depends on Q. We also assume that edges of length much larger than are 
unlikely, in the sense that 

lim lim P [3k e [t] : Zi^t ^ in Gi'Zt) and \\Zi^t - Zk,t\\ > = 0. (7) 

0,^00 t—*-oo ^ 


We note that t is the order of magnitude of the distance between a sample point and its closest 
neighbor in the sample. 


Remark 2. The conditions above can he shown to cover not only the minimum spanning tree 
(as shown by Henze and Penrose (1999)), but also nearest-neighbor graphs (Schilling, 1986) and 
general matchings (our main interest here) as shown in Section 4- 


Theorem 1. Assume the graph construction Q satisfies these assumptions, 
regime (4), 


^ 2s f 

m + n J pf{z) + {l-p)g{z) 


almost surely. 


Then, in the limiting 


( 8 ) 


The proof of Theorem 1 is exactly the same as that of Theorem 2 in (Henze and Penrose, 
1999), treating out-edges and in-edges separately, and with Proposition 1 there replaced with the 
following. As in the proof of Theorem 2 in (Henze and Penrose, 1999), Proposition 1 is used in 
the proof of Theorem 1 with the choice of (/> = p/ + (1 - p)g and ()t = (^/ + ng)lt, with m and n 
implicitly parameterized by t. (Recall that t = m + n is the total sample size.) 


Proposition 1. Let 4>,4 >i,4>2t ■ ■ be any density functions with same support such that 4’tl(4 1 

uniformly on {f) > 0}, let T^t - {Zi^t,... ,Zt^t] he IID with density 4>t, CL^d assume the conditions 
of Theorem 1 hold. Let /i : x [0,1] be measurable and such that almost any z e is a 

Lebesgue continuity point of h{z,-)4>{-). Then 


lim - E h{Zi^t, 


Zj^t)I{Zi^t Zj^t in G(Zt)} = J h{z,z)4){z)dz. 


(9) 


Proposition 1 in (Henze and Penrose, 1999) shows this when G is the minimum spanning tree 
based on properties obtained in (Aldous and Steele, 1992). The proof of Proposition 1 still borrows 
much from that of Proposition 1 in (Henze and Penrose, 1999), although it is simpler here because 
we work under more specific assumptions which (Henze and Penrose, 1999) verify for the minimum 
spanning tree along the way. 


Proof. For any a > 0, we have 


1 

t 


EY,Y.KZ),t,Zj^t)l{Z,^t 


1 ‘ 

Zj^t in G{Zt)} = - E ^ h{Zi^t, 

^ k=2 

= l[2i + ® + e:]. 


Zk,t in G{Zt)} 


( 10 ) 
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where 


21 := E ^ h{Zi^t, Zi,tnZi,t - Zk,t in g{Zt)}, 

k=2 

^:=EY^{h{Zi,t,Zk,t)-h{Zi^t,Zi,t))nZi,t^Zk,tinG{Zt) and \\Zi^t - Zj.^ > (11) 

/c-2 

t 

iJi ■= E ^ (h{Zi^t, Zk,t) - h{Zi^t, ^ Zk^t in G^Zt) and - Zk^tW ^ at 

k=2 

For the hrst term, 

2l = Eh(Zi,i,Zi,t)A°"*(^i,t;^(Zi)) 

~ 6Eh{Zi t, Zi t) = 6 f h(z, z) ^ (/)(z)dz 6 f h(z,z)4>{z)dz, t^oo. ^ ^ 

’ ’ J 4>{z) J 


In the third line we first used (6) and dominated convergence (enabled by the fact that 0 < /i < 1); 
and then dominated convergence again (enabled by the fact that 0 < h < 1 and 0 < < 2 

eventually). 

For the second term 


\^\<Ej^I{Zi^t^Zk,tinG{Zt) and\\Zi^t-Zk,t\\>at-^^'^} 

k^2 (13) 

= P[3A: € [t] : Zi^t ^ Zk,t in G{Zt) and \\Zi^t - 

using the fact that 0 < h < 1. Hence, hma_»oo limt^oo 23 = 0 by (7). 

For the third term, we have |ei| < / 'tjjt{z)4‘t{z)dz, where 

t 

tptiz) := E ^ \h{z,Zk,t) - h{z,z)\l{z Zk,t in G{z,Z 2 ,t,-- ■ ,Zt,t) and \\z - Zk,t\\ < (14) 

fc =2 


Note that, by (5), 


<EA^'^\z-,G{z, Z2,t,..., Zt,t)) < 6o. 


(15) 


We now show that iptiz) ^ 0 as t ^ oo for almost all z’s and for any hxed a > 0, which will imply 
that limt^oo Cl = 0 by dominated convergence and our assumption on (l)t. Indeed, we have 


'Gt{z) < {t-l)E\h{z,Z 2 ^t)-h{^,z)\l{\z - Z 2 ^t\ <at 

= (t-l) / \h{z,u)(j)t{u) - h{z,z)(j)tiz) + h{z,z)4it{z) - h{z,z)4>t{u)\du 

f, -h{z,z)(j){z)\du + t f \(j){z) - (j){u)\du 

,,,J<('4('w)-(('(w)|du + 2a;da'^|())4(z)-(/>(z)|, 


(16) 


where uj^ denotes the volume of the unit ball in M'^, and using the triangle inequality and the fact 
that h has values in [0,1]. Noting that the Lebesgue measure (the usual volume) of B{z,at~^l^) 
is equal to , we see that the hrst integral in the last line converges to zero as t oo when 

z is a Lebesgue continuity point of h(z,-)4’{-), and the same is true of the second integral if z is a 
Lebesgue continuity point of (p{-). We note that almost all points z satisfy these properties. This 
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follows from our assumptions on h and the fact that almost all points are Lebesgue continuity 
points of a given integral function (including (/>). Moreover, because of the uniform convergence of 
(f)t towards cj), the third integral converges to zero, and the last term converges to zero for the same 
reason. 

Thus, by taking limits as t ^ oo first and then as a ^ oo, we conclude. □ 


4 Almost sure convergence for matchings 

We now prove that certain kinds of matchings — including the matching (2) — satisfy the conditions 
of Theorem 1. Let z = {zi ,..., zt} be any set of points in Since a matching results in all vertices 
having exactly one neighbor, except for exactly one of them if t is odd, the resulting graph ^(z) 
has degree bounded by 1, so that (5) holds with Jq = 1. Assume now Z = {Zi ,..., Zt} are IID from 
a diffuse distribution on Then ^(Z)) follows a Bernoulli distribution with parameter 

1 - odd}, so that (6) is satisfied with <5 = 1. It remains to establish (7), meaning that long 
edges are rare, which is intuitively natural since matchings aim at minimizing pairing distances. 

4.1 Optimal matchings 

We start with a matching which applies to Z = {Zi, ..., Zt} c and is of the form 

minA(Z;a), A(Z;u) := ^ A(||Zfc - ||), (17) 

k=i 

where A : [0, oo) M is non-decreasing and the minimization is as in (2). Examples include A(6) = 6“ 
where a > 0, but one could imagine taking A such that limb_,oo A(6) < oo for robustness. Below, we 
assume that 

X{b)xb°‘, 6-^0, for some a e (0, d). (18) 

This condition includes the case A(6) = 6“ when a e (0,d), and in particular the original matching 
(2) in dimension d> 2. 

Proposition 2. Let Q be the result of any matching of the form (17) with A satisfying (18). Then 
Q fulfills (7). 

Proof. Let ■ ■ ■ be any density functions with same support such that (ft/tf ^ 1 uniformly 

on {(j) > 0}. Let Z^ = {Zi^t, - ■ ■, Zt^t} be IID with density (ft- Take e > 0. Take r > 0 such that 
1 - e, where A := [-r,r]'^. By our assumptions, there is to such that > 1 - 2e for all 

t > to- Let cr e argmin^. A(Zt; a) and define a matching d as follows. Let Kt = {k € [t] : j ^ A}. 

For k € KtU d{Kt), let ^{k) = d'(k). The indices k ^ KtU d{Kt) are matched between themselves 
as follows. Let so be the largest integer such that 2^°'^ < t. Consider a regular partition of A into 
2 ^ 0 ^ liypercubes, each of side length 2r2“®°, and therefore diameter 2~^°2r\/d. Match points within 
each bin arbitrarily, for example, by solving (17) within that bin. Note that each pair contributes 
at most X{2~^°2r\/d) to A(Zt ;cj). Remove the matched points. Since this leaves at most one point 
per bin, there are at most 2^°*^ points left. Form a regular partition of A into hypercubes 

and match points within each bin. Each pair now contributes at most X{2~^^°~^^2r\/d) and, after 
removing the matched points, there are at most points left. Continuing in this fashion 

defines d and we have 

A(Zt;d)< ^ A(||Zfc,i-Z^(,),,||) 

k^Ktua(Kt) 

+ tX{2~"°2r^/d) + 2"°'^A(2“(^°"^)2r\/d) + X{2~^''°~‘^hr'/d) -<-■■■ + 2’^X{2rVd). 

(19) 
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By the fact that a minimizes (17), 


k{Zt-a) > A(Zi; n) = ^ \{\\Zu,t - 

kiKtua(Kt) 


^ X{\\Zk,t-Z^^k)A)- (20) 

kiKtUa(Kt) 


By our condition (18) on A, there is C > c> 0 such that 

c6" < X{b) < Cir for all b e [0,1]. (21) 

Let jr = min{j : 2~^2r\fd < 1}. Below, are positive functions of £,d,a,c,C, and recall 

that r is a function of e. With < t < 2 (^o+i)d^ we get 

•50 jr~^ 

^ X{\\Zk,t - Z,^k),t\\) ^ E 2(^^^^^C{2-hrVdr + E A^^^^^X{2-hr^/d) 

ktKtua(Kt) j=jr j=o (22) 

< 522('^-")^° +Bi< + Bi< 


when t>B^:= . For a > 0, let 

La,t = {kiKtU a{Kt) : X{\\Zk,t - ^^(fc),t||) > Xiar^^A]- (23) 


Using (21), we have 


^ X{\\Zu,t - Z^^u),t\\) > |La,i|A(ar'/'') > |L,,t|c(ari/'')“, (24) 

kmtuo(Kt) 

when t > o'^, in which case, by combing (22) and (24), it follows that \La^t\/t ^ “7120”“ =• B 4 ^a~°‘. 
We have 

Pi(||Zi,t - < Pt(l e Kt u a{Kt)) + Pt(l e L^A, (25) 

where we have used the fact that A is non-decreasing by assumption. On the one hand, 

Pi(l e La,t) = 7 E^i(* " ^a,t) = 7 EEf[I{i e La,t}] = ^MLa,t\] ^ ^40-“. (26) 

* i=l ^ i=l ^ 

On the other hand, by construction of A, and then Kt, 

P 4 (l € Kt u a{Kt)) < Pt(l e Kt) + Pt(<7(l) e Kt) < 2(2e) = 4e. (27) 

Thus, for a > ( 714 /e)and t^B^v A, 


^t{\\Zi^t - ^ ^ 5e. 


(28) 


Since e is arbitrary, we can conclude that (7) holds. 


□ 


We may thus apply Theorem 1 to obtain the following. 


Corollary 1. Let Q he the result of any optimal matching of the form (17) with X satisfying (18). 
Then, in the limiting regime (4), 


xe(z) 


r 2p{l-p)f{z)g{z) 
J pf{z) + {l-p)g{z) 


almost surely. 


m + n 


(29) 
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4.2 Greedy matching 

By greedy matching we mean the following procedure: match the closest pair of points (in Euclidean 
distance), remove them from the sample, and repeat. (Ties are broken arbitrarily.) Let Uo denote 
this greedy matching. This can be seen as a greedy minimization strategy for (17). While a typical 
implementation of (17) has complexity this greedy procedure has complexity 0{t^^^logt). 

See the discussion in (Avis et al., 1988). There, it is proved that, if are IID from a 

distribution with compact support on (with d>2), 

^ ||Z,-Z,„(,)|| = 0(ti-i/''). (30) 

fce[t] 

(In fact, their result is much sharper.) Although this result is obtained when the underlying 
distribution does not change with t, it can be seen to easily extend to the setting of Theorem 1 as 
long as (j) has compact support. 

With this in place, we can reason as we did in Section 4.1 to find that the graph Q resulting from 
the greedy matching cjo also satisfies (7). We may thus apply Theorem 1 to derive the following. 

Corollary 2. Let Q be the result of the greedy matching. Then, assuming that the underlying 
distributions f and g have compact support, and in the limiting regime (4), the limit (29) holds. 


5 Consistency against all (fixed) alternatives 

We start with describing the behavior of the test statistic Xg(Z) under the null hypothesis where 
f - g. Rosenbaum (2005) derives the null distribution of XQ ™ closed form (which happens to be 
equal to its permutation distribution) and finds the exact moments to be^ 

where the limits are as t oo in the regime (4). Clearly, this is true for any matching. To prove 
consistency, this is all we need together with Corollary 1 (or Corollary 2), although Rosenbaum 
(2005) also shows that the null distribution is asymptotically normal. 

Corollary 3. Let Q be the result of any matching of the form (17) with A satisfying (18). For any 
sequence r]t 0 such that rjt » i/Vi, the test with rejection region {jXg(Z) < 2p{l - p) - rjt} is 
consistent against all (fixed) alternatives. This remains true ifQ is the result of the greedy matching 
and the underlying distributions have compact support. 

Proof. By Chebyshev’s inequality and the expression for the null moments (31), we see that the 
test has asymptotic size 0. Under a fixed alternative, f ^ g, Corollary 1 gives 

7 Xe(Z) ^2p(l-p) f almost surely. (32) 

t J pf{z) + {l-p)g{z) 

As Henze and Penrose (1999) argue, based on work of Gyorfi and Nemetz (1975), this limit is 
strictly smaller when f g than that when f = g (that latter being equal to 2p(l -p)). Hence, the 
test has asymptotic power 1. □ 


^Note that in our definition there is an extra factor of 2. 
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